This case study follows the six step data analysis process:

1. Ask

Business Task: Analyze Fitbit data to gain insight and help guide marketing strategy for Bellabeat to grow as a global player.

2. Prepare

Data can be found on Kaggle. The notebook consist of 18 data sets. To determine the credibility and integrity of the data, I will use the “ROCCC” system.

2a. Reliability

  • This data is low on being reliable. The data sample size only consist of 30 participants. This can limit the amount of analysis that can be done.

2b. Originality

  • This data is low on originality as the participants who submitted data, were participants of fitbit..

2c. Comprehensiveness

  • This data is considered medium comprehensiveness. The data collected matches parameters that Bellabeat needs. Having more data on the participants (age, race, health status) would greatly improve any analysis report produced.

2d. Current

  • This data is low on current. The data is 5 years old. In that time, more knowledge about health and exercise have come out which can make some of the analysis obsolete.

2e. Cited

  • This data is low on cited. The data was collected through a third party making the cited unknown.

3. Process

Setting up my enviornment

Notes: setting up my environment by installing and loading the ‘tidyverse’, ‘skimr’, ‘plotly’, ‘dplyr’, and ‘ggplot2’ packages

install.packages('tidyverse')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages('skimr')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages('plotly')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages('dplyr')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages('ggplot2')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(skimr)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(readr)
library(dplyr)

Importing Data Sets

Notes: The “Daily Activity” and “Sleep Day” data sets have been selected for our data analysis

# Import the DailyActivity and SleepDay data sets
daily_activity <- read.csv("Fitabase/dailyActivity_merged.csv")
sleep_activity <- read.csv("Fitabase/sleepDay_merged.csv")

Inspecting and Cleaning Data Sets

Notes: I will check for NA values in the data sets and remove them

head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
head(sleep_activity)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
# Inspecting column names
colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(sleep_activity)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
# Checking the total number of users for each data set. We see we have 33 IDs for daily_activity and 24 IDs for sleep_activity
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_activity$Id)
## [1] 24
# Checking for duplicate entries
nrow(daily_activity[duplicated(daily_activity),])
## [1] 0
nrow(sleep_activity[duplicated(sleep_activity),])
## [1] 3
#Removing duplicated entries
sleep_activity <- unique(sleep_activity)

Transform Data Sets

Notes: I will transform data in the data sets to help with visualizing the data

# Separate the day and time in sleep_activity
sleep_activity <- sleep_activity %>% separate(SleepDay, c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# Create a weekday column for daily_activity
daily_activity <- daily_activity %>% mutate(Weekday = weekdays(as.Date(ActivityDate, "%m/%d/%Y")))

# Putting the weekdays in order From Monday thru Sunday
daily_activity$Weekday <-factor(daily_activity$Weekday, levels= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

4. Analyse

# Given summary of each data set
daily_activity %>% select(TotalSteps, TotalDistance, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, Calories) %>% summary()
##    TotalSteps    TotalDistance    SedentaryMinutes LightlyActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :  0.0       
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:127.0       
##  Median : 7406   Median : 5.245   Median :1057.5   Median :199.0       
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :192.8       
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:264.0       
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :518.0       
##  FairlyActiveMinutes VeryActiveMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.00    Min.   :   0  
##  1st Qu.:  0.00      1st Qu.:  0.00    1st Qu.:1828  
##  Median :  6.00      Median :  4.00    Median :2134  
##  Mean   : 13.56      Mean   : 21.16    Mean   :2304  
##  3rd Qu.: 19.00      3rd Qu.: 32.00    3rd Qu.:2793  
##  Max.   :143.00      Max.   :210.00    Max.   :4900
sleep_activity %>% select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>% summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

5. Share

Creating Visualizations

Graph 1: TotalSteps vs Weekday

This graph shows the days users recorded the most steps.

# Bar graph on how many steps users recorded throughout the week
ggplot(data=daily_activity, aes(x=Weekday, y=TotalSteps, fill=Weekday))+ 
    geom_bar(stat = "identity") + labs(title = 'Total Steps Recorded', subtitle = "Users' steps are categorized by the day of the week. ", x= 'Day of the Week', y = 'Total Steps')

This graph that shows users recorded the most steps during the weekday (Tuesday - Thursday). This may be due to their occupation (nurse or kitchen worker) or way of transportation.

Graph 2: Percentage of Activity Levels

This graph shows the user’s total activity levels throughout their recording.

# Pie Chart to compare the users' activity levels

# Calculating the total minutes of activity
total_minutes <- sum(daily_activity$VeryActiveMinutes, daily_activity$FairlyActiveMinutes,daily_activity$LightlyActiveMinutes, daily_activity$SedentaryMinutes)

# Calculating the percentage of each activity level
active_minutes_percentage <- sum(daily_activity$VeryActiveMinutes)/total_minutes*100
fairlyActive_minutes_percentage <- sum(daily_activity$FairlyActiveMinutes)/total_minutes*100
lightlyActive_minutes_percentage <- sum(daily_activity$LightlyActiveMinutes)/total_minutes*100
sedentary_minutes_percentage <- sum(daily_activity$SedentaryMinutes)/total_minutes*100

# Creating a pie chart visualization with plotly
percentage <- data.frame(
  level = c("Sedentary", "Lightly", "Fairly", "Very Active"),
  minutes = c(sedentary_minutes_percentage,lightlyActive_minutes_percentage,fairlyActive_minutes_percentage,active_minutes_percentage)
)

plot_ly(percentage, labels = ~level, values = ~minutes, type = 'pie', textposition = 'outside', textinfo = 'label+percent') %>% layout(title = 'Activity Level Minutes', xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

We see that users spend 81% of their day at a sedentary level. We found that 29% of the time recorded as some form of activity.

Graph 3: Very Active Minutes vs Calories

This graph shows a scatter plot and the correlation between the users’ recording of very active minutes and calories burned

# Create a scatter plot to show the relationship between Very Active Minutes and Calories
ggplot(data = daily_activity) + geom_point(mapping = aes(x = VeryActiveMinutes, y = Calories)) + geom_smooth(mapping = aes(x = VeryActiveMinutes, y = Calories)) + labs(title = "Very Active Minutes vs Calories Burned", x = "Active Minutes", y = "Calories Burned")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We see there is a direct correlation between the very active minutes and calories burned. The more active minutes a user records, the more calories they burn.

Graph 4: Total Steps vs Calories

This graph shows to see if there is a correlation between the numnber of steps the users took versus the total amount of calories burned.

# Create a scatter plot to show the relationship between calories and steps
ggplot(data = daily_activity)+ geom_smooth(mapping = aes(x = TotalSteps, y = Calories)) + geom_point(mapping = aes(x = TotalSteps, y = Calories)) + labs(title = "Calories vs Total Steps")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Graph 5: Time in Bed vs Time Asleep

This graph shows the correlation between time in bed and minutes asleep.

# Create a scatter plot showing the correlation between time in bed and minutes asleep
ggplot(data = sleep_activity, aes(x=TotalMinutesAsleep, y = TotalTimeInBed)) + geom_smooth(mapping = aes(x = TotalMinutesAsleep, y = TotalTimeInBed)) + geom_point() + labs(title = "Sleep vs Time in Bed", x = "Total Minutes Asleep", y = "Total Time in Bed")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

6. Act

Recommendations for Bellabeat Marketing Team:

  1. We see that with more active minutes the user has, the user will burn more calories. Bellabeat should send notifications about staying active and reward users for recording activities.
  2. We also see that more steps a user takes, the more calories are burned. Bellabeat should make goals with steps alone that users can achieve through tracking their steps.
  3. Users are showing a healthy level of sleep. Bellabeat should give a notification for keeping a consistent sleep schedule.
  4. From the pie chart users are normally in a sedentary state. Bellabeat should push notifications more throughtout the day to inform people to stay active.